HAD Phase 4: trends_lin (Eq 17/18) + R-package end-to-end parity#392
HAD Phase 4: trends_lin (Eq 17/18) + R-package end-to-end parity#392
Conversation
|
Overall Assessment ⛔ Blocker Review note: this was a static review only. The container here does not have Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P0 (joint_pretrends_test consumed-placebo drop):
The detrending math collapses dy_t to mechanical zero at t = base-1
(the "consumed" placebo whose F-2 → F-1 evolution is used by the
slope estimator). Feeding that all-zero residual into
stute_joint_pretest tripped the exact-linear short-circuit and
returned a confidently-non-rejecting p_value=1.0 — a placebo that
was actually no placebo at all. Fix: compute base_minus_1_period
early (right after period_rank/base_rank), filter it out of
pre_periods before _aggregate_for_joint_test, emit UserWarning
when the filter fires, raise ValueError when nothing testable
remains. Mirrors HAD.fit's `e=-2` drop and R's "max placebo lag
reduces by 1 with the same effect" convention.
Regression tests (3):
- consumed placebo dropped + warning emitted when pre_periods
includes base_period - 1
- all-consumed case (pre_periods=[base-1]) raises ValueError
- no-consumed case (pre_periods excludes base-1) emits no warning
P1 (did_had_pretest_workflow trends_lin propagation):
Added trends_lin: bool = False kwarg to did_had_pretest_workflow,
forwarded to both joint_pretrends_test and joint_homogeneity_test
on the event-study path. Front-door rejects trends_lin=True with
aggregate="overall" via NotImplementedError (the overall path's
qug + stute + yatchew block has no joint-pretest surface).
Regression tests (2):
- workflow(aggregate=event_study, trends_lin=True) bit-exactly
matches direct joint_pretrends_test + joint_homogeneity_test
calls
- workflow(aggregate=overall, trends_lin=True) raises
NotImplementedError
P2 (parity test exercises true overall path):
The R `overall_e1` combo was being routed through Python's
aggregate="event_study" with a single horizon. Now slices the
panel to two periods (F-1, F) and routes through
aggregate="overall" — exercising the actual two-period overall
code path that the changelog and registry claim parity for.
Result-extraction logic branches on combo_name to handle the
scalar HeterogeneousAdoptionDiDResults (overall) vs the array
HeterogeneousAdoptionDiDEventStudyResults (event_study).
Stats: 534 tests pass (529 prior + 5 new R1 regressions), 0
regressions. Parity tolerances unchanged: atol=1e-8 (point/SE/CI),
atol=1e-10 (Yatchew T-stat after N/(N-1) shift).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentStatic review only: this container does not have Executive Summary
MethodologyAffected methods:
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P1 (workflow consumed-placebo handling): did_had_pretest_workflow(aggregate="event_study", trends_lin=True) computed `earlier_pre = t_pre_list[:-1]` and dispatched to joint_pretrends_test BEFORE accounting for the consumed-placebo drop that joint_pretrends_test applies internally. On a minimal valid panel (4 periods, F=3, base=2, earlier_pre=[1]=base-1=consumed), the workflow propagated joint_pretrends_test's "no testable placebo horizons remain" ValueError instead of taking the existing "no earlier placebo → pretrends_joint=None, skip step 2" verdict path. Fix: pre-filter t_pre_list[-2] (= base_period-1) out of earlier_pre when trends_lin=True. If empty, the existing `len(earlier_pre) >= 1` guard sets pretrends_joint=None and step 3 (homogeneity) still runs. Regression test: minimal 4-period F=3 panel, trends_lin=True, asserts workflow succeeds, pretrends_joint is None, homogeneity_joint is populated with finite p_value. P3 (parameter docstrings): Added trends_lin parameter documentation to the four affected public surfaces: - HeterogeneousAdoptionDiD.fit - joint_pretrends_test - joint_homogeneity_test - did_had_pretest_workflow Each docstring covers: the formula, R-package mirror, requirements (F>=3 / base-1 in panel), the consumed-placebo behavior (auto-drop + UserWarning on direct surfaces; skip-step-2 escape on workflow), the survey-weighting NotImplementedError contract, and bit-exact default backcompat. Stats: 535 tests pass (534 prior + 1 new R2 P1 regression), 0 regressions. Parity tolerances unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only: this container lacks Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P1 (non-terminal-base methodology guard):
joint_pretrends_test(trends_lin=True) and joint_homogeneity_test(
trends_lin=True) silently accepted base_period values other than the
last validated pre-period (= F-1). Paper Eq 17 / Eq 18 specifically
anchor the detrending at F-1 with slope Y[F-1] - Y[F-2]; non-
terminal anchors compute a different slope at a different anchor —
silently changing the methodology away from the documented Eq
17/18 / R DIDHAD::did_had(trends_lin=TRUE) construction.
Fix: under trends_lin=True, require base_period == t_pre_list[-1]
inside the validator block. The workflow and HAD.fit always pass
F-1, so they're unaffected. Direct callers passing a non-terminal
base get a clear ValueError pointing at t_pre_list[-1].
Regression tests (2):
- joint_pretrends_test(base=F-2, trends_lin=True) raises
- joint_homogeneity_test(base=F-2, trends_lin=True) raises
P1 (ordered-categorical observed-period base-1 lookup):
The previous base_period - 1 lookup walked period_rank (= the full
ordered-categorical level list) decrementing by rank. On panels
with unused intermediate categorical levels (e.g. dtype levels
[t1, t2, t_unused, t3, t4] with only t1..t4 observed), base=t3
under trends_lin would resolve base-1 to t_unused (rank 2), and
the slope-pivot lookup would KeyError because t_unused is not in
the data.
Fix: derive base_minus_1 from the validator's t_pre_list[-2] —
the validator builds t_pre_list from observed contiguous pre-
periods only, so it correctly skips unused categorical levels.
Both joint wrappers and the workflow now use the same observed-
period predecessor, eliminating the failure mode.
Restructure: moved the trends_lin consumed-placebo drop from
before the validator block into it (so t_pre_list is in scope).
Added an early `n_periods < 3` gate for trends_lin so 2-period
panels get a clean error instead of silently failing downstream.
Regression test:
- joint_pretrends_test on an ordered-categorical panel with an
unused intermediate level succeeds (no KeyError on slope pivot)
Stats: 538 tests pass (535 prior + 3 new R3 P1 regressions), 0
regressions. Parity tolerances unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Static review only: this container does not have Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
P0 (observed-period detrending delta):
The R3 P1 fix made the slope LOOKUP use observed periods, but the
detrending DELTA (`t_rank - base_rank`) still pulled ranks from the
full categorical dtype via _build_period_rank. On panels with
unused intermediate categorical levels, the same observed data
produced different (t - base) multipliers and corrupted the
joint test statistic — silent wrong statistical output.
Fix: under trends_lin=True, build a fresh observed_rank dict from
sorted(set(data_filtered[time_col].unique())) and use it for both
the base and horizon ranks in the delta computation. Mirrors
HAD.fit's `_aggregate_multi_period_first_differences` convention
(`sorted(t_pre_list + t_post_list, ...)` for the event-time rank).
Both joint wrappers fixed; workflow inherits the fix automatically.
Regression tests (2):
- joint_pretrends_test on (categorical with 2 unused levels)
produces identical cvm_stat_joint and p_value to (categorical
without unused levels) on the same observed data
- joint_homogeneity_test twin invariant (unused level between
base and post)
P3 (exact upstream-version pin):
The parity contract cited DIDHAD v2.0.0 / SHA edc09197 in
CHANGELOG, REGISTRY, and the parity test docstrings, but the
generator and test only enforced `>= 2.0.0`. Future regeneration
could silently re-anchor goldens to a newer release while docs
still cited the old version.
Fix: pin exactly DIDHAD == 2.0.0 and YatchewTest == 1.1.1 in both
the generator's stopifnot guards and the parity test's metadata
assertion. Document the bump procedure in the comments.
Stats: 540 tests pass (538 prior + 2 new R4 P0 regressions), 0
regressions. All 24 R-parity cells still green at atol=1e-8 /
1e-10. Note: existing categorical-level invariance tests added in
R3 still pass — they exercised correctness on simple unused-level
shifts; the R4 invariants are stricter, asserting bit-exact
identity of the joint statistics across categorical re-ordering of
the same observed panel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R5 was ✅ Looks good — only P3 polish remained. All addressed: P3 #1 — exact-pin nprobust: The parity contract runs through nprobust numerical paths (DIDHAD's local-linear bandwidth + bias-correction calls), so a fresh regeneration could drift if CRAN serves a newer nprobust. Pin nprobust == 0.5.0 in both the R generator's stopifnot guard and the parity test's metadata assertion alongside DIDHAD and YatchewTest. P3 #2 — workflow docstring: did_had_pretest_workflow's top-level docstring still said "Eq 18 linear-trend detrending is a Phase 4 follow-up" which contradicts the shipped trends_lin behavior. Updated to describe the forwarding contract (trends_lin → joint_pretrends_test + joint_homogeneity_test, consumed-placebo skip path on minimal panels). Same fix on the StuteJointResult class docstring. P3 #3 — parity test horizon-shape assertions: Added an explicit "missing in Python" assertion in _zip_r_python: every R-mapped event time must be present in Python's event_times (catches future horizon-shape regressions where Python silently drops a horizon R requested). Added an effects+placebo row-count sanity check in test_yatchew_t_stat_parity (uses the previously- unused effects/placebo parametrize values to catch fixture drift). Stats: 540 tests pass, 0 regressions. No estimator/methodology changes — all P3 polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R6 was ✅ Looks good — 2 P3 polish items. P3 #1 — version-aware repro installer: benchmarks/R/requirements.R installed whatever CRAN currently served via install.packages, while the generator and parity test hard-pin DIDHAD == 2.0.0 / YatchewTest == 1.1.1 / nprobust == 0.5.0. A fresh R environment regenerating the goldens would have the generator's stopifnot(packageVersion == "X.Y.Z") immediately abort. Fix: add `install_pinned_version()` helper using remotes::install_version with `upgrade = "never"`, run it after the bulk CRAN install for DIDHAD/YatchewTest/nprobust. Idempotent when the correct version is already installed. Bump procedure documented in lockstep with the generator + parity-test pins. P3 #2 — exact-set parity event_times: _zip_r_python() previously asserted only that R-mapped horizons were a SUBSET of Python's event_times (missing-in-python check). Tighten to FULL SET EQUALITY: also reject horizons present in Python but absent from R's requested set ("extra_in_python"). This catches future event_study horizon-selection regressions in both directions — e.g. if our effects/placebo cap drifts and Python emits an extra row R didn't request. Stats: 540 tests pass, 0 regressions. Still no estimator changes — all P3 polish on the parity / repro infrastructure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R7 was ✅ Looks good — only 1 P3 docstring nit. The trends_lin parameter docstrings on joint_pretrends_test and joint_homogeneity_test only documented the "base_period - 1 must exist in panel" requirement, but the actual public contract is stricter: direct callers must also pass `base_period == F-1` (the last validated pre-period). The PR #392 R3 P1 guard enforces this in code; the docstring just understated it. Updated both docstrings to state explicitly that base_period must be the canonical F-1 anchor, not just any pre-period — and that non-terminal anchors raise ValueError per Eq 17 / R semantics. Stats: 540 tests pass, 0 regressions. Doc-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good - no unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Execution note: static review only; I could not run the targeted tests in this container, though |
R8 was ✅ Looks good — only 1 P3 doc nit. The CHANGELOG and REGISTRY parity claim said "HAD R-package end- to-end parity" without qualifying that the harness explicitly forces `HeterogeneousAdoptionDiD(design="continuous_at_zero")`. R `did_had` always evaluates the local-linear at d=0 regardless of dose distribution; our default `design="auto"` may legitimately resolve to `continuous_near_d_lower` or `mass_point` on dose distributions with boundary density bounded away from zero (e.g. Beta(2,2) at G=200), in which case the WAS estimand evaluates at a different point and diverges from R numerically. That divergence is methodologically defensible — our auto-detect uses more information when boundary mass is sparse — but it means the parity test does NOT validate the default `design="auto"` surface, only the Design 1' surface that R also uses. Updated both wording surfaces to qualify "on the `design='continuous_at_zero'` (Design 1') surface" and explain the auto-detect divergence as out-of-scope-for-this-test (not a defect). Stats: 540 tests pass, 0 regressions. Doc-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Adds `trends_lin: bool = False` keyword-only kwarg to
`HeterogeneousAdoptionDiD.fit(aggregate="event_study")`,
`joint_pretrends_test`, and `joint_homogeneity_test`. Mirrors R
`DIDHAD::did_had(..., trends_lin=TRUE)` (paper Eq 17 / Eq 18 / page 32
joint-Stute homogeneity-with-trends). Per-group linear-trend slope
estimated as `Y[g, F-1] - Y[g, F-2]`; applied uniformly as
`(t - base) × slope` adjustment that absorbs both the effect-side
detrending and the placebo-side anchor swap. Requires F ≥ 3 (panel
must contain F-2). The "consumed" placebo at our event-time `e=-2` is
auto-dropped (R reduces max placebo lag by 1 with the same effect).
Mutually exclusive with survey weighting; raises NotImplementedError.
Bit-exact backcompat for trends_lin=False (default) across all
existing surfaces.
Adds end-to-end R-package parity test vs `DIDHAD` v2.0.0
(Credible-Answers/did_had, SHA `edc09197`). New generator
`benchmarks/R/generate_did_had_golden.R` produces
`benchmarks/data/did_had_golden.json` covering 3 paper-derived
synthetic DGPs (Uniform, Beta(2,2), Beta(0.5,1)) × 5 method
combinations (overall, event-study, placebo, yatchew, trends_lin).
`tests/test_did_had_parity.py` asserts:
- point estimate / SE / CI bounds at atol=1e-8
- closed-form Yatchew T-stat at atol=1e-10 after a documented
`× G/(G-1)` finite-sample convention shift
on all 24 (DGP × combo) cells.
Two intentional convention deviations from R, documented in
REGISTRY.md and the parity test docstring:
(a) we report the bias-corrected point estimate (modern CCF 2018
convention); R's `Estimate` reports the conventional estimate
with the bias-corrected CI separately. Our `att` matches R's
CI midpoint.
(b) Yatchew uses paper Appendix E's literal (1/G) variance
denominator; R uses base-R `var()`'s (1/(N-1)) sample-variance
convention. Ratio is exactly N/(N-1); both converge to the same
asymptotic null.
Yatchew on placebos with R's mean-independence null (`order=0`) is
not exposed in our `yatchew_hr_test` and is skipped in the parity
test; tracked as TODO follow-up.
Stats:
- 16 new direct trends_lin unit tests in test_had_pretests.py
- 24 new R-parity tests in test_did_had_parity.py
- 489 baseline + 16 + 24 = 529 tests pass, 0 regressions
- Generator script: ~280 LoC; fixture: 117 KB
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P0 (joint_pretrends_test consumed-placebo drop):
The detrending math collapses dy_t to mechanical zero at t = base-1
(the "consumed" placebo whose F-2 → F-1 evolution is used by the
slope estimator). Feeding that all-zero residual into
stute_joint_pretest tripped the exact-linear short-circuit and
returned a confidently-non-rejecting p_value=1.0 — a placebo that
was actually no placebo at all. Fix: compute base_minus_1_period
early (right after period_rank/base_rank), filter it out of
pre_periods before _aggregate_for_joint_test, emit UserWarning
when the filter fires, raise ValueError when nothing testable
remains. Mirrors HAD.fit's `e=-2` drop and R's "max placebo lag
reduces by 1 with the same effect" convention.
Regression tests (3):
- consumed placebo dropped + warning emitted when pre_periods
includes base_period - 1
- all-consumed case (pre_periods=[base-1]) raises ValueError
- no-consumed case (pre_periods excludes base-1) emits no warning
P1 (did_had_pretest_workflow trends_lin propagation):
Added trends_lin: bool = False kwarg to did_had_pretest_workflow,
forwarded to both joint_pretrends_test and joint_homogeneity_test
on the event-study path. Front-door rejects trends_lin=True with
aggregate="overall" via NotImplementedError (the overall path's
qug + stute + yatchew block has no joint-pretest surface).
Regression tests (2):
- workflow(aggregate=event_study, trends_lin=True) bit-exactly
matches direct joint_pretrends_test + joint_homogeneity_test
calls
- workflow(aggregate=overall, trends_lin=True) raises
NotImplementedError
P2 (parity test exercises true overall path):
The R `overall_e1` combo was being routed through Python's
aggregate="event_study" with a single horizon. Now slices the
panel to two periods (F-1, F) and routes through
aggregate="overall" — exercising the actual two-period overall
code path that the changelog and registry claim parity for.
Result-extraction logic branches on combo_name to handle the
scalar HeterogeneousAdoptionDiDResults (overall) vs the array
HeterogeneousAdoptionDiDEventStudyResults (event_study).
Stats: 534 tests pass (529 prior + 5 new R1 regressions), 0
regressions. Parity tolerances unchanged: atol=1e-8 (point/SE/CI),
atol=1e-10 (Yatchew T-stat after N/(N-1) shift).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1 (workflow consumed-placebo handling): did_had_pretest_workflow(aggregate="event_study", trends_lin=True) computed `earlier_pre = t_pre_list[:-1]` and dispatched to joint_pretrends_test BEFORE accounting for the consumed-placebo drop that joint_pretrends_test applies internally. On a minimal valid panel (4 periods, F=3, base=2, earlier_pre=[1]=base-1=consumed), the workflow propagated joint_pretrends_test's "no testable placebo horizons remain" ValueError instead of taking the existing "no earlier placebo → pretrends_joint=None, skip step 2" verdict path. Fix: pre-filter t_pre_list[-2] (= base_period-1) out of earlier_pre when trends_lin=True. If empty, the existing `len(earlier_pre) >= 1` guard sets pretrends_joint=None and step 3 (homogeneity) still runs. Regression test: minimal 4-period F=3 panel, trends_lin=True, asserts workflow succeeds, pretrends_joint is None, homogeneity_joint is populated with finite p_value. P3 (parameter docstrings): Added trends_lin parameter documentation to the four affected public surfaces: - HeterogeneousAdoptionDiD.fit - joint_pretrends_test - joint_homogeneity_test - did_had_pretest_workflow Each docstring covers: the formula, R-package mirror, requirements (F>=3 / base-1 in panel), the consumed-placebo behavior (auto-drop + UserWarning on direct surfaces; skip-step-2 escape on workflow), the survey-weighting NotImplementedError contract, and bit-exact default backcompat. Stats: 535 tests pass (534 prior + 1 new R2 P1 regression), 0 regressions. Parity tolerances unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1 (non-terminal-base methodology guard):
joint_pretrends_test(trends_lin=True) and joint_homogeneity_test(
trends_lin=True) silently accepted base_period values other than the
last validated pre-period (= F-1). Paper Eq 17 / Eq 18 specifically
anchor the detrending at F-1 with slope Y[F-1] - Y[F-2]; non-
terminal anchors compute a different slope at a different anchor —
silently changing the methodology away from the documented Eq
17/18 / R DIDHAD::did_had(trends_lin=TRUE) construction.
Fix: under trends_lin=True, require base_period == t_pre_list[-1]
inside the validator block. The workflow and HAD.fit always pass
F-1, so they're unaffected. Direct callers passing a non-terminal
base get a clear ValueError pointing at t_pre_list[-1].
Regression tests (2):
- joint_pretrends_test(base=F-2, trends_lin=True) raises
- joint_homogeneity_test(base=F-2, trends_lin=True) raises
P1 (ordered-categorical observed-period base-1 lookup):
The previous base_period - 1 lookup walked period_rank (= the full
ordered-categorical level list) decrementing by rank. On panels
with unused intermediate categorical levels (e.g. dtype levels
[t1, t2, t_unused, t3, t4] with only t1..t4 observed), base=t3
under trends_lin would resolve base-1 to t_unused (rank 2), and
the slope-pivot lookup would KeyError because t_unused is not in
the data.
Fix: derive base_minus_1 from the validator's t_pre_list[-2] —
the validator builds t_pre_list from observed contiguous pre-
periods only, so it correctly skips unused categorical levels.
Both joint wrappers and the workflow now use the same observed-
period predecessor, eliminating the failure mode.
Restructure: moved the trends_lin consumed-placebo drop from
before the validator block into it (so t_pre_list is in scope).
Added an early `n_periods < 3` gate for trends_lin so 2-period
panels get a clean error instead of silently failing downstream.
Regression test:
- joint_pretrends_test on an ordered-categorical panel with an
unused intermediate level succeeds (no KeyError on slope pivot)
Stats: 538 tests pass (535 prior + 3 new R3 P1 regressions), 0
regressions. Parity tolerances unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P0 (observed-period detrending delta):
The R3 P1 fix made the slope LOOKUP use observed periods, but the
detrending DELTA (`t_rank - base_rank`) still pulled ranks from the
full categorical dtype via _build_period_rank. On panels with
unused intermediate categorical levels, the same observed data
produced different (t - base) multipliers and corrupted the
joint test statistic — silent wrong statistical output.
Fix: under trends_lin=True, build a fresh observed_rank dict from
sorted(set(data_filtered[time_col].unique())) and use it for both
the base and horizon ranks in the delta computation. Mirrors
HAD.fit's `_aggregate_multi_period_first_differences` convention
(`sorted(t_pre_list + t_post_list, ...)` for the event-time rank).
Both joint wrappers fixed; workflow inherits the fix automatically.
Regression tests (2):
- joint_pretrends_test on (categorical with 2 unused levels)
produces identical cvm_stat_joint and p_value to (categorical
without unused levels) on the same observed data
- joint_homogeneity_test twin invariant (unused level between
base and post)
P3 (exact upstream-version pin):
The parity contract cited DIDHAD v2.0.0 / SHA edc09197 in
CHANGELOG, REGISTRY, and the parity test docstrings, but the
generator and test only enforced `>= 2.0.0`. Future regeneration
could silently re-anchor goldens to a newer release while docs
still cited the old version.
Fix: pin exactly DIDHAD == 2.0.0 and YatchewTest == 1.1.1 in both
the generator's stopifnot guards and the parity test's metadata
assertion. Document the bump procedure in the comments.
Stats: 540 tests pass (538 prior + 2 new R4 P0 regressions), 0
regressions. All 24 R-parity cells still green at atol=1e-8 /
1e-10. Note: existing categorical-level invariance tests added in
R3 still pass — they exercised correctness on simple unused-level
shifts; the R4 invariants are stricter, asserting bit-exact
identity of the joint statistics across categorical re-ordering of
the same observed panel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R5 was ✅ Looks good — only P3 polish remained. All addressed: P3 #1 — exact-pin nprobust: The parity contract runs through nprobust numerical paths (DIDHAD's local-linear bandwidth + bias-correction calls), so a fresh regeneration could drift if CRAN serves a newer nprobust. Pin nprobust == 0.5.0 in both the R generator's stopifnot guard and the parity test's metadata assertion alongside DIDHAD and YatchewTest. P3 #2 — workflow docstring: did_had_pretest_workflow's top-level docstring still said "Eq 18 linear-trend detrending is a Phase 4 follow-up" which contradicts the shipped trends_lin behavior. Updated to describe the forwarding contract (trends_lin → joint_pretrends_test + joint_homogeneity_test, consumed-placebo skip path on minimal panels). Same fix on the StuteJointResult class docstring. P3 #3 — parity test horizon-shape assertions: Added an explicit "missing in Python" assertion in _zip_r_python: every R-mapped event time must be present in Python's event_times (catches future horizon-shape regressions where Python silently drops a horizon R requested). Added an effects+placebo row-count sanity check in test_yatchew_t_stat_parity (uses the previously- unused effects/placebo parametrize values to catch fixture drift). Stats: 540 tests pass, 0 regressions. No estimator/methodology changes — all P3 polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R6 was ✅ Looks good — 2 P3 polish items. P3 #1 — version-aware repro installer: benchmarks/R/requirements.R installed whatever CRAN currently served via install.packages, while the generator and parity test hard-pin DIDHAD == 2.0.0 / YatchewTest == 1.1.1 / nprobust == 0.5.0. A fresh R environment regenerating the goldens would have the generator's stopifnot(packageVersion == "X.Y.Z") immediately abort. Fix: add `install_pinned_version()` helper using remotes::install_version with `upgrade = "never"`, run it after the bulk CRAN install for DIDHAD/YatchewTest/nprobust. Idempotent when the correct version is already installed. Bump procedure documented in lockstep with the generator + parity-test pins. P3 #2 — exact-set parity event_times: _zip_r_python() previously asserted only that R-mapped horizons were a SUBSET of Python's event_times (missing-in-python check). Tighten to FULL SET EQUALITY: also reject horizons present in Python but absent from R's requested set ("extra_in_python"). This catches future event_study horizon-selection regressions in both directions — e.g. if our effects/placebo cap drifts and Python emits an extra row R didn't request. Stats: 540 tests pass, 0 regressions. Still no estimator changes — all P3 polish on the parity / repro infrastructure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R7 was ✅ Looks good — only 1 P3 docstring nit. The trends_lin parameter docstrings on joint_pretrends_test and joint_homogeneity_test only documented the "base_period - 1 must exist in panel" requirement, but the actual public contract is stricter: direct callers must also pass `base_period == F-1` (the last validated pre-period). The PR #392 R3 P1 guard enforces this in code; the docstring just understated it. Updated both docstrings to state explicitly that base_period must be the canonical F-1 anchor, not just any pre-period — and that non-terminal anchors raise ValueError per Eq 17 / R semantics. Stats: 540 tests pass, 0 regressions. Doc-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R8 was ✅ Looks good — only 1 P3 doc nit. The CHANGELOG and REGISTRY parity claim said "HAD R-package end- to-end parity" without qualifying that the harness explicitly forces `HeterogeneousAdoptionDiD(design="continuous_at_zero")`. R `did_had` always evaluates the local-linear at d=0 regardless of dose distribution; our default `design="auto"` may legitimately resolve to `continuous_near_d_lower` or `mass_point` on dose distributions with boundary density bounded away from zero (e.g. Beta(2,2) at G=200), in which case the WAS estimand evaluates at a different point and diverges from R numerically. That divergence is methodologically defensible — our auto-detect uses more information when boundary mass is sparse — but it means the parity test does NOT validate the default `design="auto"` surface, only the Design 1' surface that R also uses. Updated both wording surfaces to qualify "on the `design='continuous_at_zero'` (Design 1') surface" and explain the auto-detect divergence as out-of-scope-for-this-test (not a defect). Stats: 540 tests pass, 0 regressions. Doc-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
476cc32 to
fc436a5
Compare
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
The HeterogeneousAdoptionDiD example snippet in choosing_estimator.rst failed the Pure Python Fallback CI job (test_doc_snippets.py ::test_doc_snippet[choosing_estimator:block7]) due to three latent drift bugs from PR #389 (docs-rtd-audit): 1. Missing aggregate='event_study' on both did_had_pretest_workflow and HAD.fit calls — default aggregate='overall' requires exactly 2 periods, but the doc-snippet test framework's namespace `data` (built via generate_staggered_data) has 10 periods. 2. Used the namespace's generic `data` variable, which has nonzero dose in every period (rng.choice from {0.0, 0.5, 1.0, 2.0}). HAD requires D=0 for all units in at least one pre-period. 3. `print(f"Estimate: {results.att:.3f}")` formatted att as a scalar, but under aggregate='event_study' results.att is a numpy array. Fix: rewrite the snippet to construct its own HAD-shape panel inline (mirrors how block6 handles ContinuousDiD with its own data generator); thread aggregate='event_study' through both calls; iterate the per-horizon att array for output. Pre-existing on origin/main; surfaced on this PR's CI re-run after the rebase. Other failing snippets (troubleshooting:block18, :block20, r_comparison:block6, :block7) are also pre-existing on main but are out of scope for this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors R YatchewTest::yatchew_test(order=0). Closes the placebo Yatchew R-parity gap from PR #392. - New keyword-only `null: Literal["linearity", "mean_independence"]` on `yatchew_hr_test` (default `"linearity"` is bit-exact backcompat). - `"mean_independence"` fits intercept-only OLS (residuals = dy - mean(dy)); the downstream sigma2_diff / sigma2_W / sort-by-d machinery is shared. - Wired through both unweighted and survey-weighted code paths (4-arm dispatch on (null × weighted)). - `YatchewTestResults` gained `null_form: str = "linearity"` field; `summary()` renders the correct null-hypothesis title; `__repr__` and `to_dict()` updated. - `tests/test_did_had_parity.py::TestYatchewParity` removed the placebo skip; routes effect rows through `null="linearity"` (R order=1) and placebo rows through `null="mean_independence"` (R order=0); both modes share the documented `× G/(G-1)` finite-sample convention shift and parity holds at `atol=1e-10`. - New `TestYatchewHRTestMeanIndependence` class (15 tests) covering happy path, naive Python baseline at `atol=1e-12`, population-variance closed form, invalid value, default backcompat, mode-agnostic tie/constant-d rejection, NaN handling, weighted reduction at w=ones(G) at `atol=1e-14`, weighted non-uniform baseline, default-under-weights, survey×null orthogonality, the (linearity, weighted) baseline (4-arm coverage), zero/replicate-weight rejection, and G<3 mode-agnostic. One additive backcompat case in each of `TestYatchewHRTest` and `TestYatchewHRTestSurvey`. - REGISTRY.md HAD § Yatchew note: TODO marker replaced with shipped description. CHANGELOG.md and TODO.md updated. Patch-level (additive keyword-only kwarg + additive dataclass field with default). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors R YatchewTest::yatchew_test(order=0). Closes the placebo Yatchew R-parity gap from PR #392. - New keyword-only `null: Literal["linearity", "mean_independence"]` on `yatchew_hr_test` (default `"linearity"` is bit-exact backcompat). - `"mean_independence"` fits intercept-only OLS (residuals = dy - mean(dy)); the downstream sigma2_diff / sigma2_W / sort-by-d machinery is shared. - Wired through both unweighted and survey-weighted code paths (4-arm dispatch on (null × weighted)). - `YatchewTestResults` gained `null_form: str = "linearity"` field; `summary()` renders the correct null-hypothesis title; `__repr__` and `to_dict()` updated. - `tests/test_did_had_parity.py::TestYatchewParity` removed the placebo skip; routes effect rows through `null="linearity"` (R order=1) and placebo rows through `null="mean_independence"` (R order=0); both modes share the documented `× G/(G-1)` finite-sample convention shift and parity holds at `atol=1e-10`. - New `TestYatchewHRTestMeanIndependence` class (15 tests) covering happy path, naive Python baseline at `atol=1e-12`, population-variance closed form, invalid value, default backcompat, mode-agnostic tie/constant-d rejection, NaN handling, weighted reduction at w=ones(G) at `atol=1e-14`, weighted non-uniform baseline, default-under-weights, survey×null orthogonality, the (linearity, weighted) baseline (4-arm coverage), zero/replicate-weight rejection, and G<3 mode-agnostic. One additive backcompat case in each of `TestYatchewHRTest` and `TestYatchewHRTestSurvey`. - REGISTRY.md HAD § Yatchew note: TODO marker replaced with shipped description. CHANGELOG.md and TODO.md updated. Patch-level (additive keyword-only kwarg + additive dataclass field with default). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
trends_lin: bool = Falsekeyword-only kwarg onHeterogeneousAdoptionDiD.fit(aggregate="event_study"),joint_pretrends_test,joint_homogeneity_test. Mirrors RDIDHAD::did_had(..., trends_lin=TRUE)(paper Eq 17 / Eq 18 / page 32 joint-Stute-with-trends).DIDHADv2.0.0 (Credible-Answers/did_had, SHAedc09197). 24 parity tests across 3 paper-derived synthetic DGPs (Uniform, Beta(2,2), Beta(0.5,1)) × 5 method combinations (overall, event-study, placebo, yatchew, trends_lin).Tolerances achieved
atol=1e-8(full pipeline throughnprobustnumerical paths).atol=1e-10after a documented× G/(G-1)finite-sample convention shift.Two intentional convention deviations from R (documented in REGISTRY.md)
(a) Point estimate convention: we report the bias-corrected point estimate
att = (mean(ΔY) - tau.bc) / mean(D)(modern CCF 2018 convention); R'sEstimatecolumn reports the conventional(mean(ΔY) - tau.us) / mean(D)with the bias-corrected CI separately. Ourattmatches R's CI midpoint; ourse/conf_int_low/conf_int_highmatch R'sse/ci_lo/ci_hidirectly.(b) Yatchew variance denominator: our
yatchew_hr_testfollows paper Appendix E literally with the(1/G)and(1/(2G))population-variance denominators. R'sYatchewTest::yatchew_testuses base-Rvar()'s(1/(N-1))sample-variance convention. Ratio is exactlyN/(N-1); both converge to the same asymptotic null distribution.Feature gap (TODO follow-up)
yatchew_hr_testalways fitsY ~ D(linearity null). R'sYatchewTest(order=0)fitsY ~ 1(intercept-only, mean-independence null), which is whatDIDHADuses on placebo rows ("non-parametric pre-trends test" per the README). Adding anull="mean_independence"mode toyatchew_hr_testwould close the placebo Yatchew parity gap; tracked in TODO.md.Test plan
tests/test_had_pretests.py::TestJointPretestsTrendsLin,::TestHADFitTrendsLin): F<3 guard, naive-Python detrending baseline atatol=1e-12, survey_design × trends_linNotImplementedError, get/set_params idempotence, consumed-placebo (e=-2) drop invariant.tests/test_did_had_parity.py::TestPointSEParity,::TestYatchewParity,::TestFixtureMetadata).trends_lin=False(default) preserves all 489 baseline test outputs unchanged.Rscript benchmarks/R/generate_did_had_golden.Rwritesbenchmarks/data/did_had_golden.json(117 KB committed).🤖 Generated with Claude Code